Networks are complex structures used to represent relations. In math they are studied as graphs, but they can be used in multiple ways. As a way to represent knowledge, to define a model, as database systems, or to represent information. Many machine learning models are built as graphs: Neural networks, decision trees, and even causal models. In this course, we will focus on networks for social science. Here, we will use graphs to represent data about social relations between entities (people, institutions, countries, etc.).
In order to work with networks, we first need to go through some basic definitions.
\[ G=(V, E) \] Where \(V\) is the set of nodes, and \(E\) the set of links.
\(V\): {A,B,C}, \(E\): {1,2}
A node that has a link that points to itself is called a loop.
A walk is a sequence of nodes and links, like
(B, 1, A, 2, C) in our simple network, where the links
between nodes are those that connect them, so it can also be represented
as (1, 2).
A path is a walk where all the links are different (it does not go through the same link twice).
The length of a path is its number of links, and the shortest path2 There can be more than one shortest path between a pair of nodes. between two nodes is the one that connects them and has the least number of links.
The distance3 If there is no path between the two nodes, we can say that the distance is \(\infty\). between two nodes is given by the length of the shortest path that connects them.
A sub-graph \(G'=(V', E')\) is a subset of \(G\) where \(V'\) is a subset of \(V\) and \(E'\) is a subset of \(E\) that only contains links that connect nodes in \(V'\).
An ego network is obtained when we take a node and build the sub-graph that contains all the nodes that are connected to it, and all the respective links that connect them.4 We can also build the ego network that contains all the nodes at a distance of 2, 3, etc.
Two nodes are connected if there is path between them. If all the nodes of the network are connected, then it is a connected network.
The relation between two nodes can either be dichotomous or quantitative. For example, a link can represent co-authorship – given two authors, they either co-authored a paper or not– and for it we might use an unweighted network. But a link can also represent the number of papers two co-authors wrote together, and in this case we might use a weighted network. A weighted network can be represented as an unweighted one for those relations above a given threshold.
Connections can have a direction5 This changes a little the definition of walks and paths,
as they now need to follow the direction of links. Also the distance
between two nodes can be asymmetrical.. For example, in Twitter, B
can follow A, who follows C, that also follows
A back. While in Facebook, the connections are defined as
friendship, where there is no direction.
Networks can encode complex relations between different types of entities. For example, if we want to represent people that belongs to different institutions, we can have nodes that represent people {\(A_1,A_2,...,A_5\)}, and other type of nodes that represent institutions {\(I_1,I_2,...,I_4\)}, and here the links represent the relation of belonging.
This type of networks can also be simplified using projection, where we keep only one type of node. In the example above, we can relate institutions to each other if they have people in common, and link people if they belong to the same institution:
If we want to use network analysis to study a social process, a static representation can fall short to understand how the social interactions evolve over time. For this, we can also think networks as a dynamic process, where new nodes enter, old nodes disappear, and links change over time.
Lets take a small random network and see how it can be represented and which metrics can be used to describe it:
## 1 2 3 4 5 6 7 8 9 10
## 1 0 0 0 0 1 0 1 1 0 0
## 2 0 0 0 0 0 0 0 0 0 1
## 3 0 0 0 1 0 0 0 0 1 0
## 4 0 0 1 0 1 1 0 0 1 0
## 5 1 0 0 1 0 0 1 1 0 0
## 6 0 0 0 1 0 0 1 0 1 0
## 7 1 0 0 0 1 1 0 0 0 1
## 8 1 0 0 0 1 0 0 0 0 0
## 9 0 0 1 1 0 1 0 0 0 0
## 10 0 1 0 0 0 0 1 0 0 0
## [,1] [,2]
## [1,] 3 4
## [2,] 1 5
## [3,] 4 5
## [4,] 4 6
## [5,] 1 7
## [6,] 5 7
## [7,] 6 7
## [8,] 1 8
## [9,] 5 8
## [10,] 3 9
## [11,] 4 9
## [12,] 6 9
## [13,] 2 10
## [14,] 7 10
## $`1`
## + 3/10 vertices, named, from 9df1f50:
## [1] 5 7 8
##
## $`2`
## + 1/10 vertex, named, from 9df1f50:
## [1] 10
##
## $`3`
## + 2/10 vertices, named, from 9df1f50:
## [1] 4 9
A key concept in network analysis is centrality. It refers to the importance of a node within the network. Generally speaking, if a node is well connected, it can exercise more influence over the network.
Let’s imagine our network is composed of a group of 10 people. Links represent friendship and trust between them. If we want to know how fake news are spread across the network, centrality measures can help us to understand what would happen if any of the nodes start a rumor8 This same kind of analysis can be made for the spread of diseases, political beliefs, among many different research questions..
Who do you think that can be more effective spreading the misinformation? Node 4, 7 or 2?
There are many ways to define the influence of a node, and which one is the best depends on the research question.
## # A tibble: 10 × 2
## node degree
## <int> <dbl>
## 1 1 3
## 2 2 1
## 3 3 2
## 4 4 4
## 5 5 4
## 6 6 3
## 7 7 4
## 8 8 2
## 9 9 3
## 10 10 2
We can see that by degree centrality nodes 4, 5, and 7 are considered equally important.
\[ closeness(i) = \frac{N-1}{\sum_{j\neq i}d_{i,j}} \]
## # A tibble: 10 × 2
## node closeness
## <int> <dbl>
## 1 5 0.067
## 2 7 0.067
## 3 4 0.059
## 4 6 0.059
## 5 1 0.056
## 6 9 0.05
## 7 10 0.048
## 8 8 0.045
## 9 3 0.042
## 10 2 0.034
When we consider closeness, node 4 is not as important as in degree centrality. As it is not connected to node 7, its distance to nodes 10 and 2 is rather long.
\[ betweenness(i)= \sum_{j,k\neq i}\frac{b_{jik}}{b_{jk}} \]Where \(b_{jk}\) are all the shortest paths between nodes \(j\) and \(k\), and \(b_{jik}\) are all of those shortest paths that go through \(i\).
## # A tibble: 10 × 2
## node betweenness
## <int> <dbl>
## 1 7 16.7
## 2 5 10.2
## 3 4 8.83
## 4 10 8
## 5 6 7
## 6 1 1.83
## 7 9 1.5
## 8 2 0
## 9 3 0
## 10 8 0
For betweenness, the node 7 is the most important by far, as it is the only bridge to nodes 10 and 2, and one of the two connections between nodes {5, 1, 8} and nodes {6, 9, 4, 3}.
This is a recursive problem, as the centrality can be defined as
\[ x_i = \kappa^{-1} \sum_{\text{nodes j connected to i}}x_j \]
The centrality of the node \(x_i\) is some proportion of the centrality of its neighbors (\(\kappa\) being a proportionality constant).
We can use the adjacency matrix \(A\) to redefine this problem, given that if the nodes are connected, \(A_{ij}\) is 1, and 0 otherwise.
\[ x_i = \kappa^{-1} \sum_{j}^nA_{ij}x_j \] It is called eigen centrality, because the solution to this equation is the eigen vector of the adjacency matrix10 In algebra this is defined as \(Ax=\lambda x\), where \(\lambda\) is the eigen value and \(x\) the eigen vector that solve this equation. For the scope of this class, we can just think that this gives us the solution of our recursive problem..
## # A tibble: 10 × 2
## node eigen_centrality
## <int> <dbl>
## 1 5 1
## 2 4 0.96
## 3 7 0.91
## 4 6 0.81
## 5 1 0.77
## 6 9 0.72
## 7 8 0.56
## 8 3 0.52
## 9 10 0.32
## 10 2 0.1
When we consider eigen centrality, node 5 is more important than node 7. This happens because node 5 is connected to both nodes 4 and 7, which are the two following nodes in importance, while nodes 4 and 7 are not connected between them.
The preferential attachments mechanism for building networks is an interesting model to rethink inequalities. The time component allows to think how if a small difference happens repeatedly over time, and has a cumulative nature, it can create an outcome with extreme inequalities. This model also shows that those who enter first to the system of accumulation already have a unbeatable advantage with respect to those that enter afterwards.
Income has such cumulative mechanism. Money makes money. If a group of the population was systematically deprived for generations, even if there seems to be equal opportunities at some point in time, the cumulative nature of income will continue to affect that group. This is why colorblind policies are not enough, and without restorative actions reaching equality is impossible.
Homophily is one of the biggest sociological ideas that came up from network studies. A classical example has always been the friendship network in a US high school, which shows how black and white kids tend to play with other kids that share their identity (Bearman, Moody, and Stovel 2002). Nevertheless, this concept is also problematic, and without enough care it can foster misleading conclusions.
When we talk about identities, there are privileged and excluded populations. The concept of homophily equas all groups to a single dynamic, when in reality there are different phenomena happening for different groups.
The empirical observation of a high modularity needs to be properly contextualized. An extreme example would be the residential segregation in US. In the 30’s, the Federal Housing Administration made public housing projects exclusively for white people. If we would measure the racial segregation by neighborhoods at that time, we could think that both black and white people showed homophily behaviors. But black people did not had a choice, as they were excluded from white neighborhoods. The reasons behind the behavior of each group are strikingly different, and modularity alone cannot capture that complexity.
Social networks
Random networks
Networks are complex structures. There are no simple tests like in descriptive statistics that we can use to check if a network follows a property, but random networks can be used to compare some properties. Random networks are generative models, where we can specify some parameters like the number of nodes, edges, or the probability of a link between nodes, and a network is built following a specific algorithm.
Preferential attachment model
These models can also help us infer which are the evolutive properties of a network. For example, the preferential attachment model iteratively adds new nodes to the network. This new nodes have a probability to connect with old nodes that is given by the degree of the old nodes. This iterative process recreates the rich gets richer effect.
One type of preferential attachment model is the Barabási-Albert (BA) model. The algorithm of the BA model works as follows:
\[ p_{i}\sim k_i^\alpha \]Where \(k_i\) is the degree of the node \(i\) and \(\alpha\) is a parameter that defines the power of the preferential attachment.
Let’s see how this process looks like:
Let’s see the degree distribution of such type of networks:
The centrality follows a power-law distribution, where a few nodes have a really high centrality, while the great majority has a low centrality.
Small world
Another property that is commonly found in real-world networks is the small world phenomena: as networks get bigger in number of nodes, the average distance of the network only increases mildly. Huge networks with millions of nodes have very short distances between any pair of nodes, even if it is sparsely connected12 “Six degrees of separation”: is the idea that all people are six or fewer social connections away from each other. Although this number is somehow forced, in Facebook the average distance between any pair of users is 5.73, and in twitter 4.67..
This happens because of a combination of clustering and power-law degree distributions: there are a few highly connected nodes that guarantee the more distant connections, and then the densely connected neighborhoods easily connect the node with that highly connected node in its neighborhoods.
Communities
Detecting communities within a network is a common task in SNA.
Finding Twitter communities, for example, can be the key for understanding the patterns of political polarization (Conover et al. 2011). A community in a network is a sub-graph in which the nodes are densely connected within them, and sparsely connected to nodes outside their community.
It is the same concept as clustering, but the techniques used to detect them are different, given the particular structure of network’s data.
Modularity (\(Q\)) (Newman 2006) is an important metric for community detection. It shows the extent to which the number of links between a group of nodes is greater (\(Q>0\)) or smaller (\(Q<0\)) than expected at random in that graph.
Let’s see how our networks looks like if we build communities using Louvain algorithm:
These are just two examples, but there are many other algorithms, and which one is the best fit depends on the research question.
Homophily
One of the main findings across several studies is the concept of homophily: the idea that people tend to relate more to others that they perceive similar to them in some way.
A great example is the political homophily: people tend to relate more with people that share their political beliefs. For example, the figure below shows the Twitter network of Argentina.
Twitter Argentina (González 2016)
Twitter US also shows a similar split between democrats and republicans.
Twitter US (Ribeiro et al., n.d.)
To measure homophily, we can also use the metric of modularity, but instead of measuring the groups built as a community detection algorithm, we use it with taking the nodes grouped by an attribute attached to each one of them (e.g. democrat or republican).